This course, Introduction to Open Data Science 2020, builds on 1) openness of science and data, and 2) data science (e.g., wrangling, examining, analyzing, and visualizing data). Because data science also includes collaborating with others and using openly available software tools and services (e.g., R, RStudio, Git, GitHub), openness is the key word in this course.
The course has 6 main topics:
1. Introduction
2. Regression and model validation
3. Logistic regression
4. Clustering and classification
5. Dimensionality reduction techniques
6. Analysis of longitudinal data
Each week has deadlines for 1) submitting exercises, and 2) peer reviewing submissions from a few of the other students.
The course platform in MOOC contains lots of relevant information: instructions regarding the course, discussion forums, lecture videos, other videos, exercises, a place for submissions, links to other helpful resources (including books). Lectures (and smaller group sessions) take place on mondays!
Overall, this course will introduce ways to become better at interpreting data, presenting findings, and sharing knowledge.
My GitHub repository: https://github.com/PinjaMarin/IODS-project
I really enjoyed the first zoom webinar and found the DataCamp videos extremely informative. The first part “Hello R” clarified the way R works although its content was familiar to me. Most of the topics in the second part “Getting hooked on R” were new to me.
I have not used GitHub before so I really appreciated the clear instructions on the course site and the opportunity to look at the webinar regarding GitHub and RStudio.
I am very excited about the upcoming weeks, particularly about getting more familiar with GitHub and learning to use R better. I expect the R part to be time consuming, occasionally frustrating, and rewarding.
#Loading libraries
library(dplyr)
library(ggplot2)
library(GGally)
library(tidyverse)
getwd()
## [1] "C:/Users/Pinja/Documents/GitHub/IODS-project"
#setting working directory as the data folder to read in the data. Reading the modified data into R from my local folder and saving it into lrn2014
setwd('./data')
lrn2014 <- read.csv('learning2014.csv')
#examining the data
dim(lrn2014) #166 observations and 8 variables (one is an additional id-variable)
## [1] 166 8
str(lrn2014)
## 'data.frame': 166 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
#Reading data from online to certainly have it correct
lrn14 <-read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", sep=",", header=TRUE)
dim(lrn14) #the data has 166 observations and 7 variables
## [1] 166 7
str(lrn14) #the variables are: gender, age, attitude, deep, stra, surf, and points
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
summary(lrn14)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
glimpse(lrn14)
## Rows: 166
## Columns: 7
## $ gender <chr> "F", "M", "F", "M", "M", "F", "M", "F", "M", "F", "M", "F"...
## $ age <int> 53, 55, 49, 53, 49, 38, 50, 37, 37, 42, 37, 34, 34, 34, 35...
## $ attitude <dbl> 3.7, 3.1, 2.5, 3.5, 3.7, 3.8, 3.5, 2.9, 3.8, 2.1, 3.9, 3.8...
## $ deep <dbl> 3.583333, 2.916667, 3.500000, 3.500000, 3.666667, 4.750000...
## $ stra <dbl> 3.375, 2.750, 3.625, 3.125, 3.625, 3.625, 2.250, 4.000, 4....
## $ surf <dbl> 2.583333, 3.166667, 2.250000, 2.250000, 2.833333, 2.416667...
## $ points <int> 25, 12, 24, 10, 22, 21, 21, 31, 24, 26, 31, 31, 23, 25, 21...
The data is a modified and shortened version of a longer data set examining students’ approaches to learning and study skills on a statistics course in 2014.
The shortened data contains 166 participants, and the following variables:
The values of points and attitude are integers, other numerical values have decimals.
glimpse(lrn14) #gender is the first column
## Rows: 166
## Columns: 7
## $ gender <chr> "F", "M", "F", "M", "M", "F", "M", "F", "M", "F", "M", "F"...
## $ age <int> 53, 55, 49, 53, 49, 38, 50, 37, 37, 42, 37, 34, 34, 34, 35...
## $ attitude <dbl> 3.7, 3.1, 2.5, 3.5, 3.7, 3.8, 3.5, 2.9, 3.8, 2.1, 3.9, 3.8...
## $ deep <dbl> 3.583333, 2.916667, 3.500000, 3.500000, 3.666667, 4.750000...
## $ stra <dbl> 3.375, 2.750, 3.625, 3.125, 3.625, 3.625, 2.250, 4.000, 4....
## $ surf <dbl> 2.583333, 3.166667, 2.250000, 2.250000, 2.833333, 2.416667...
## $ points <int> 25, 12, 24, 10, 22, 21, 21, 31, 24, 26, 31, 31, 23, 25, 21...
#pairs function gave an error about gender having an 'invalid color name'
#Lets try with a recoded value of gender where M=1 and F=2.
lrn142 <- lrn14 %>%
mutate(gender_num = recode(gender, "M"="1", "F"="2"))
#recoding it also to be numeric
lrn142 <- lrn142 %>%
mutate(gender_num = as.numeric(gender_num))
mean <- mean(lrn142$gender_num)
mean #mean 1.7 and would be 1.5 if even men and women, so more women in the data
## [1] 1.662651
#pairs(lrn14[c(-1)], col=lrn14$gender_num) # I used this to check colours -> red is 2=females, black is 1= males
#a scatter plot of the variables (gender only shown with colours)
pairs(lrn142[c(-1,-8)], col=lrn142$gender_num)
#Lets make a plot with scatter plots, correlations, and distributions
plot_c <- ggpairs(lrn14, 1:7, progress = ggmatrix_progress(clear=T), mapping = aes(col=gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)), proportions = "auto")
plot_c
#Lets make and print the same plot but without genders separated
plot <- ggpairs(lrn14, 2:7, progress = ggmatrix_progress(clear=T), lower = list(combo = wrap("facethist", bins = 20)), proportions = "auto")
plot
summary(lrn14) #shows number of observations for character variables, and range and values, such as mean, for numerical variables.
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
Based on the plots and summary information:
Three strongest correlations in the whole data are between:
#Fitting a linear model
r3_model <- lm(points ~ attitude + stra + gender, data = lrn14)
summary(r3_model)
##
## Call:
## lm(formula = points ~ attitude + stra + gender, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.7179 -3.3285 0.5343 3.7412 10.9007
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9798 2.4030 3.737 0.000258 ***
## attitude 3.5100 0.5956 5.893 2.13e-08 ***
## stra 0.8911 0.5441 1.638 0.103419
## genderM -0.2236 0.9248 -0.242 0.809231
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.304 on 162 degrees of freedom
## Multiple R-squared: 0.2051, Adjusted R-squared: 0.1904
## F-statistic: 13.93 on 3 and 162 DF, p-value: 3.982e-08
#strategic way of learning or gender were not significantly (p >.05, even > 0.1) related to the exam points. This significance level is not enough to reliably reject hypothesis 0 of the t-test (which states that no relation exists between a particular x variable and the y variable).
#Lets make the model with only the significant predictor, i.e. attitude.
r1_model <- lm(points ~ attitude, data = lrn14)
summary(r1_model)
##
## Call:
## lm(formula = points ~ attitude, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The new model shows that mathematically:
exam point= 11.64 + 3.53*attitude (+e)
The intercept is 11.64 which means that if attitude was 0, this would be the predicted value of y. This is not so clear to interpret now (just shows that y is not 0 when x would be 0). However, I think that the x-variable can be centered so that the mean value of x is zero. In that case the intercept would be the predicted (mean) y-value at the mean of x.
The 3.53 beta value of attitude is quite strong. It shows that for each one point change in the attitude variable, the points in exam go up 3.5 points. That is, the more positive attitude one has, the more likely one does well in the exam.
Both the intercept and attitude are significant at p < .001. This means that H0 is rejected for attitude and H1 can be accepted, i.e. there is a relationship between attitude and points. For the intercept, the significant t-test means that the intercept is not 0.
In short, of attitude, strategic way of learning, and gender, only attitude significantly (p<.001) predicted exam points. With each one point change in the attitude variable, the points in exam are predicted to go up 3.5 points.
Other model information from the model summary:
#making a plot to view the final model
qplot(attitude, points, data = lrn14) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
#Plots for residuals vs fitted values (1), normal QQ-plot (2), and residuals vs. leverage (5)
plot(r1_model,which= c(1,2,5),
par(mfrow = c(2,2)))
A linear regression model has (among other things) the following assumptions in order to guarantee its predictions validity/reliability:
From the plots we see that:
#loading libraries
library(tibble)
library(dplyr)
library(tidyr)
library(tidyverse)
library(ggplot2)
library(GGally)
library(plotly)
library(foreign)
library(tibble)
library(foreign)
library(corrgram)
library(psych)
#Reading in the joined (student alcohol consumption) data from online (due to the comment in the discussion forum about the correct Ns - I had too many....Although this seems to be the case also in the online link, oh well.)
alc <-read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep=",", header=TRUE)
glimpse(alc) #382 participants, 35 variables.
## Rows: 382
## Columns: 35
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "G...
## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, ...
## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "...
## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", ...
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3,...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2,...
## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other", "ser...
## $ Fjob <chr> "teacher", "other", "other", "services", "other", "other...
## $ reason <chr> "course", "course", "other", "home", "home", "reputation...
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "...
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "ye...
## $ guardian <chr> "mother", "father", "mother", "mother", "father", "mothe...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1,...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1,...
## $ failures <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,...
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no",...
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "ye...
## $ paid <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes...
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", ...
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", ...
## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5,...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5,...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5,...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4,...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5,...
## $ absences <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, ...
## $ G1 <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 1...
## $ G2 <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 1...
## $ G3 <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, ...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
head(alc,6)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## 1 GP F 18 U GT3 A 4 4 at_home teacher course
## 2 GP F 17 U GT3 T 1 1 at_home other course
## 3 GP F 15 U LE3 T 1 1 at_home other other
## 4 GP F 15 U GT3 T 4 2 health services home
## 5 GP F 16 U GT3 T 3 3 other other home
## 6 GP M 16 U LE3 T 4 3 services other reputation
## nursery internet guardian traveltime studytime failures schoolsup famsup paid
## 1 yes no mother 2 2 0 yes no no
## 2 no yes father 1 2 0 no yes no
## 3 yes yes mother 1 2 3 yes no yes
## 4 yes yes mother 1 3 0 no yes yes
## 5 yes no father 1 2 0 no yes yes
## 6 yes yes mother 1 2 0 no yes yes
## activities higher romantic famrel freetime goout Dalc Walc health absences G1
## 1 no yes no 4 3 4 1 1 3 6 5
## 2 no yes no 5 3 3 1 1 3 4 5
## 3 no yes no 4 3 2 2 3 3 10 7
## 4 yes yes yes 3 2 2 1 1 5 2 15
## 5 no yes no 4 3 2 1 2 5 4 6
## 6 yes yes no 5 4 2 1 2 5 10 15
## G2 G3 alc_use high_use
## 1 6 6 1.0 FALSE
## 2 5 6 1.0 FALSE
## 3 8 10 2.5 TRUE
## 4 14 15 1.0 FALSE
## 5 10 10 1.5 FALSE
## 6 15 15 1.5 FALSE
The variables that will be examined here
The Hypotheses:
Alcohol consumption will be…
# choosing only the variables of interest
alc_s <- alc %>%
select(studytime,failures,absences,G3, alc_use, high_use)
#transforming data into long format for plotting
alc_sl <- alc_s %>%
gather(var, measure, -high_use)
#Visualising the distributions
alc_sl %>%
ggplot(aes(measure)) + geom_density() +
facet_wrap(~var, scales = "free_x", "free_y") +
theme_bw()
## Warning: Coercing `nrow` to be an integer.
## Warning in sanitise_dim(nrow): NAs introduced by coercion
## Warning: `nrow` is missing or less than 1 and will be treated as NULL.
#Looking into means and other summary statistics
summary(alc_s)
## studytime failures absences G3
## Min. :1.000 Min. :0.0000 Min. : 0.000 Min. : 0.00
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 8.00
## Median :2.000 Median :0.0000 Median : 3.000 Median :11.00
## Mean :2.034 Mean :0.2906 Mean : 5.319 Mean :10.39
## 3rd Qu.:2.000 3rd Qu.:0.0000 3rd Qu.: 8.000 3rd Qu.:14.00
## Max. :4.000 Max. :3.0000 Max. :75.000 Max. :20.00
## alc_use high_use
## Min. :1.000 Mode :logical
## 1st Qu.:1.000 FALSE:270
## Median :1.500 TRUE :112
## Mean :1.877
## 3rd Qu.:2.500
## Max. :5.000
#the percentage in the high use (and not high use) categories
prop.table(table(alc$high_use))
##
## FALSE TRUE
## 0.7068063 0.2931937
#Looking into variable relationships
ggpairs(alc_s,1:5, progress = ggmatrix_progress(clear=T), mapping = aes(alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)), proportions = "auto")
# a visual correlation matrix
corrgram(alc_s, order=T, cor.method="spearman")
#examining the predictors separately among students with high vs low alcohol use, red is higher use.
ggpairs(alc_s, 1:4, progress = ggmatrix_progress(clear=T), mapping = aes(col=high_use, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20, size=2.0)), proportions = "auto")
Comments about the variables and the hypothesis
#A logistic regression model for predicting alcohol consumption (high vs low) with the chosen four predictors
m <- glm(high_use ~ failures + absences + studytime + G3, data = alc, family = "binomial")
# printing out the model summary
summary(m)
##
## Call:
## glm(formula = high_use ~ failures + absences + studytime + G3,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5848 -0.8046 -0.6663 1.1027 2.2046
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.43155 0.46417 -0.930 0.352522
## failures 0.34172 0.16495 2.072 0.038296 *
## absences 0.06262 0.01759 3.560 0.000371 ***
## studytime -0.53006 0.15863 -3.341 0.000833 ***
## G3 0.01105 0.02852 0.388 0.698313
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 423.47 on 377 degrees of freedom
## AIC: 433.47
##
## Number of Fisher Scoring iterations: 4
#creating coefficients as odds ratios
OR <- coef(m) %>% exp
# computing confidence intervals for the odds ratios
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
#combining and printing the ORs and CIs
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.6495044 0.2590351 1.6065887
## failures 1.4073732 1.0190133 1.9541493
## absences 1.0646266 1.0308109 1.1041295
## studytime 0.5885684 0.4271793 0.7966834
## G3 1.0111158 0.9568568 1.0703968
#testing: would G3 be significant as the only predictor
m2 <- glm(high_use ~ G3, data = alc, family = "binomial")
summary(m2) #no, not significant
##
## Call:
## glm(formula = high_use ~ G3, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8933 -0.8349 -0.8122 1.5495 1.6227
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.71272 0.26843 -2.655 0.00793 **
## G3 -0.01621 0.02379 -0.681 0.49572
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 461.75 on 380 degrees of freedom
## AIC: 465.75
##
## Number of Fisher Scoring iterations: 4
The summary shows that three of the predictors significantly predict low vs. high alcohol use. Compared to those with low alcohol use, lower study time (p<.001), and more absences (p<.001) and class failures (p<.05) are more typical for those with high alcohol use. Final grade did not significantly differ between the alcohol use groups. Except for the grade variable the relations are as stated in the hypothesis.
The odds ratios (OR) and their confidence intervals show that, with 95% confidence, the odds ratio for…
In addition, G3 had an odds ratio of 1.01 and its CI went above and below 1. This matches the fact that it does not significantly predict the alcohol use category. Both groups are equally likely (OR=1) to get specific grades. Also the near 1 OR of absences (OR= 1.06) reflects the low significance (and usefulness) of this variable.
Overall, OR < 1 means that the examined feature is less likely in the “high” group, and OR > 1 mean that the feature is more likely in the “high” group.
#choosing the statistically significant variables from the first model, and making a new model with them
m_f <- glm(high_use ~ failures + absences + studytime, data = alc, family = "binomial")
#Exploring the predictive power of this modified model
# predicting the probability of high_use
probabilities <- predict(m_f, type = "response")
# adding the predicted probabilities to the data
alc <- mutate(alc, probability = probabilities)
#predicting high_use with the probabilities
alc <- mutate(alc, prediction = probability > 0.5)
select(alc, failures, absences, studytime, high_use, probability, prediction) %>% tail(10) #two misclassifications can be seen with these 10 observations (prediction based on model probabilities vs. actual high_use)
## failures absences studytime high_use probability prediction
## 373 1 0 1 FALSE 0.3714011 FALSE
## 374 1 14 1 TRUE 0.5879619 TRUE
## 375 0 2 3 FALSE 0.1450987 FALSE
## 376 0 7 1 FALSE 0.4011299 FALSE
## 377 1 0 3 FALSE 0.1702142 FALSE
## 378 0 0 2 FALSE 0.2025250 FALSE
## 379 1 0 2 FALSE 0.2582354 FALSE
## 380 1 0 2 FALSE 0.2582354 FALSE
## 381 0 3 1 TRUE 0.3423836 FALSE
## 382 0 0 1 TRUE 0.3011900 FALSE
# Creating a 2x2 cross tabulation of the target variable, i.e. actual case of high alcohol use, versus the predicted high vs. low use.
table(high_use = alc$high_use, prediction = alc$prediction) #This gives numbers
## prediction
## high_use FALSE TRUE
## FALSE 256 14
## TRUE 89 23
# The same tabulation but in percentages, and with error sums
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67015707 0.03664921 0.70680628
## TRUE 0.23298429 0.06020942 0.29319372
## Sum 0.90314136 0.09685864 1.00000000
# Making a plot to visualise the prediction accuracy (the false positives and false negatives that the model creates are colored with blue in the plot)
g <- ggplot(alc, aes(x = probability, y = high_use, col=prediction)) + geom_point()
g
#computing the average number of incorrect predictions.
# 1) a loss function i.e. define mean prediction error
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# 2)computing incorrectly classified cases for our model
loss_func(class = alc$high_use, prob = alc$probability) #27% cases incorrectly classified. This can also be obtained by combining the wrongly classified percentages from the cross-tabulation table (0.23 + 0.037)
## [1] 0.2696335
Based on the cross-tabulations of actual and predicted alcohol use class, and the value obtained from the loss_function, the model has an accuracy rate of (100-27) = 73%. This means that overall, 27% of cases are incorrectly classified into high or low consumption class. Since a random sorting (through numerous guesses) to these two classes would give an accuracy rate of 50%, the model seems to be an improvement to this.
# Performing a 10-fold cross-validation on the modified model
library(boot)
##
## Attaching package: 'boot'
## The following object is masked from 'package:psych':
##
## logit
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m_f, K = 10)
# average number of wrong predictions, i.e. prediction error (The cv.glm function saves the error in delta)
cv$delta[1]
## [1] 0.2827225
The model has a prediction error of 0.28, which is not smaller (better) than the one in DataCamp of 0.26.
Could I find a better model? Perhaps, by looking into all variable correlations with the alcohol use variable and choosing the ones that correlate with it the strongest. But how much better would this model be, it shall remain a mystery for now.
#loading the MASS package
library(MASS)
#loading the data from it
data("Boston")
dim(Boston)#506 obs, 14 var.
## [1] 506 14
str(Boston) # all values are numerical, 3 have integer values and the rest have decimals.
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
head(Boston,10)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0.0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## 7 0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.60
## 8 0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.90
## 9 0.21124 12.5 7.87 0 0.524 5.631 100.0 6.0821 5 311 15.2 386.63
## 10 0.17004 12.5 7.87 0 0.524 6.004 85.9 6.5921 5 311 15.2 386.71
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
## 7 12.43 22.9
## 8 19.15 27.1
## 9 29.93 16.5
## 10 17.10 18.9
The dataset contains information related to housing in the suburbs of Boston. There are 506 rows and 14 columns (variables) in the data. A few examples of variable measures are:
More information about the data can be found here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
Let’s take a closer look at the variables
#loading packages
library(tidyverse)
library(corrplot)
library(GGally)
#a graphical overview of the data
#lets do pairs on parts of data at a time to get a clearer picture
pairs(Boston[1:7])
pairs(Boston[7:14])
#looking at distributions and correlations
ggpairs(Boston,lower = list(combo = wrap("facethist", bins = 20)),
upper = list(continuous = wrap("cor", size = 1.5)))
# calculating the correlation matrix and rounding it
cor_matrix<-cor(Boston) %>% round(2)
#looking into the correlations
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# visualizing the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d",tl.cex=0.6)
#show summaries of the variables
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
#two additional histograms
hist(Boston$black)
hist(Boston$lstat)
#centering and standardizing the variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled) #matrix
## [1] "matrix" "array"
# transforming it into a data frame
boston_scaled <- as.data.frame(boston_scaled)
In the standardized (and centered) data all variable means are now zero. Also the min and max values changed (min is now negative), and range became smaller for many of the variables due to the applied transformation formula.
#creating a quantile vector of crim and print it
q <- quantile(boston_scaled$crim)
q
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
#Using the quantiles as the break points in the new categorical variable
crime <- cut(boston_scaled$crim, breaks = q, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
#looking into the new factor crime variable
table(crime) #seems fine
## crime
## low med_low med_high high
## 127 126 126 127
#adding the crime variable to the scaled data
boston_scaled <- data.frame(boston_scaled, crime)
#dropping the original crim variable from the data
boston_scaled <- dplyr::select(boston_scaled, -crim)
#separating the data so that 80% of the data belongs to the train set (and 20% to the test set)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# randomly choosing 80% of the rows
training <- sample(n, size = n * 0.8)
# create the train set
train <- boston_scaled[training,]
# create the test set
test <- boston_scaled[-training,]
dim(test) #102 obs
## [1] 102 14
dim(train) #404 obs
## [1] 404 14
#Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables.
#fitting the LDA
lda.fit <- lda(crime ~ ., data = train)
#printing the model object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2475248 0.2673267 0.2500000 0.2351485
##
## Group means:
## zn indus chas nox rm age
## low 0.8889015 -0.8999943 -0.11484506 -0.8553421 0.43899528 -0.8353411
## med_low -0.1074996 -0.2564144 0.01930799 -0.5333024 -0.12326288 -0.2640359
## med_high -0.3666748 0.2460621 0.19544522 0.4630010 0.09217204 0.4367287
## high -0.4872402 1.0172655 -0.06511327 1.0464728 -0.39055884 0.8180510
## dis rad tax ptratio black lstat
## low 0.8178003 -0.6913090 -0.7471776 -0.45337566 0.37791357 -0.76791990
## med_low 0.3247290 -0.5490693 -0.4728386 -0.05216806 0.32189395 -0.10765962
## med_high -0.4102739 -0.4087748 -0.2798662 -0.40066356 0.06105261 0.01991081
## high -0.8758633 1.6366336 1.5129868 0.77903654 -0.91509073 0.96165711
## medv
## low 0.52812207
## med_low -0.01101704
## med_high 0.17459926
## high -0.69360117
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11406300 0.696616746 -0.919981153
## indus 0.00203339 -0.315921652 0.328982955
## chas -0.06769767 -0.093041650 0.121228137
## nox 0.38698109 -0.753801869 -1.344513739
## rm -0.10256818 -0.140186623 -0.152909822
## age 0.27167693 -0.277850835 -0.001678226
## dis -0.05146156 -0.306385485 0.204890750
## rad 3.23339274 0.854436047 -0.037029630
## tax -0.01213106 -0.004091378 0.389529523
## ptratio 0.09327539 0.123653064 -0.218696848
## black -0.13914142 -0.029569191 0.110188548
## lstat 0.26489617 -0.242049403 0.374161405
## medv 0.22931903 -0.351471932 -0.205967372
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9482 0.0384 0.0134
# target classes as numeric
target <- as.numeric(train$crime)
#Draw the LDA (bi)plot
plot(lda.fit, dimen = 2, col=target, pch=target)
#Save the (correct) crime categories from the test set
correct_crime <- test$crime
# remove the categorical crime variable from the test dataset
test <- dplyr::select(test, -crime)
#
#Predict the classes with the LDA model on the test data.
lda.pred <- predict(lda.fit, newdata = test)
#
#Cross tabulate the results with the crime categories from the test set.
table(correct = correct_crime, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 19 7 1 0
## med_low 3 14 1 0
## med_high 1 13 10 1
## high 0 0 0 32
At least before knitting:
The LDA predicted all high crime rate cases correctly, and most of the medium high, medium low and low crime rate instances also correctly. Still, med_high and low cases had a notable proportion (>40%) of cases misclassified. The plot also shows visually that the high crime cases are clearly separate from most of the other classes’ cases.
The lda.fit object also showed that the first dimension explained 95% of the (overall explained) variance between the crime rate categories, the second dimension explained 3.6% and the last 1.2% of the variance covered by the model (before knitting!). On the first dimension, rad (= accessibility to radial highways) mostly separated the crime rates. The group means show that it is the highest in high crime rate areas.
#Reload the Boston dataset
data("Boston")
#standardize the dataset (to get comparable distances below).
#saving the sclaed data as a data frame
boston_scaled2 <- scale(Boston) %>% as.data.frame()
#looking into last 10 observations
tail(boston_scaled2,10)
## crim zn indus chas nox rm
## 497 -0.3864333 -0.4872402 -0.2108898 -0.2723291 0.2615253 -1.2732886
## 498 -0.3889003 -0.4872402 -0.2108898 -0.2723291 0.2615253 -0.6982955
## 499 -0.3923020 -0.4872402 -0.2108898 -0.2723291 0.2615253 -0.3780642
## 500 -0.3994275 -0.4872402 -0.2108898 -0.2723291 0.2615253 -1.0185268
## 501 -0.3940157 -0.4872402 -0.2108898 -0.2723291 0.2615253 -0.3666782
## 502 -0.4128204 -0.4872402 0.1156240 -0.2723291 0.1579678 0.4388814
## 503 -0.4148387 -0.4872402 0.1156240 -0.2723291 0.1579678 -0.2343159
## 504 -0.4130378 -0.4872402 0.1156240 -0.2723291 0.1579678 0.9839863
## 505 -0.4073609 -0.4872402 0.1156240 -0.2723291 0.1579678 0.7249547
## 506 -0.4145899 -0.4872402 0.1156240 -0.2723291 0.1579678 -0.3624084
## age dis rad tax ptratio black lstat
## 497 0.15365093 -0.4732098 -0.4076377 -0.1022751 0.343873 0.4406159 1.18846991
## 498 0.07194248 -0.4285218 -0.4076377 -0.1022751 0.343873 0.4406159 0.20262208
## 499 -0.11634223 -0.6581830 -0.4076377 -0.1022751 0.343873 0.4406159 0.03738054
## 500 0.17496618 -0.6625521 -0.4076377 -0.1022751 0.343873 0.4282384 0.34265729
## 501 0.39522376 -0.6158695 -0.4076377 -0.1022751 0.343873 0.4406159 0.23483018
## 502 0.01865435 -0.6251775 -0.9818712 -0.8024176 1.175303 0.3868341 -0.41773387
## 503 0.28864751 -0.7159308 -0.9818712 -0.8024176 1.175303 0.4406159 -0.50035464
## 504 0.79666096 -0.7729187 -0.9818712 -0.8024176 1.175303 0.4406159 -0.98207574
## 505 0.73626775 -0.6677760 -0.9818712 -0.8024176 1.175303 0.4028263 -0.86444617
## 506 0.43430172 -0.6126402 -0.9818712 -0.8024176 1.175303 0.4406159 -0.66839688
## medv
## 497 -0.30801068
## 498 -0.46023251
## 499 -0.14491587
## 500 -0.54721641
## 501 -0.62332733
## 502 -0.01444002
## 503 -0.21015379
## 504 0.14865480
## 505 -0.05793197
## 506 -1.15610373
#and a summary
summary(boston_scaled2)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
#Calculate the distances between the observations.
#First,euclidean distance:
#matrix
dist_eu <- dist(boston_scaled2) #euclidean distance is the default
#summary
summary(dist_eu) # median distance is 4.8 (out of 14.4)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
#Second,manhattan distance:
# matrix
dist_man <- dist(boston_scaled2, method="manhattan")
#summary
summary(dist_man) # median distance is 12.6 (out of 48.9)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
#Run k-means algorithm on the dataset.
km <-kmeans(boston_scaled2, centers = "4")
#Investigate what is the optimal number of clusters and run the algorithm again.
#looking at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes; the optimal number of clusters is when the total WCSS drops radically.
#to always use the same initial cluster center
set.seed(123)
#Let's check how the WCSS change through 10 different cluster amounts (1 to 10 clusters).
k_max <- 10 #the max number of clusters
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss}) #calculating the WCSS
#examining the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
#The biggest drop comes at cluster change from 1 to 2 so 2 clusters seems to be the best choice.
#Let's rerun the k-means clustering
km_f <-kmeans(boston_scaled2, centers = "2")
#plotting (visualising) the clusters (in two subsets to see the results clearer)
pairs(boston_scaled2[1:7], col =km_f$cluster)
pairs(boston_scaled2[8:14], col =km_f$cluster)
#transforming km_f into a factor for the visualization
km_f$cluster <- as.factor(km_f$cluster)
#some further visualization with the two subsets
ggpairs(boston_scaled2[1:7],lower = list(combo = wrap("facethist", bins = 20)),upper = list(continuous = wrap("cor", size = 1.5)), mapping=ggplot2::aes(colour = km_f$cluster, alpha=0.5))
ggpairs(boston_scaled2[8:14],lower = list(combo = wrap("facethist", bins = 20)),upper = list(continuous = wrap("cor", size = 1.5)), mapping=ggplot2::aes(colour = km_f$cluster, alpha=0.5))
The plots indicate that most of the variables are differently distributed among the two clusters. Still, it is difficult to say which variables would differ significantly. At least crim (crime rate), indus (amount of non-retail business acres), nox (nitrogen oxides concentration), lstat (amount of lower status in the population), tax (property-tax), rad (accessibility to radial highways), dis (distance to employment centers), and age appear to separate between the clusters. Overall, the clusters seem to partially differentiate between higher and lower-class neighborhoods.
Bonus:
#Performing the cluster analysis
km3 <-kmeans(boston_scaled2, centers = "3")
#making the target variable numeric
clusters <- as.numeric(km3$cluster)
#adding the target variable into the data
boston_scaled2 <- data.frame(boston_scaled2, clusters)
#OBS! I had to comment out the below code because the lda-function gave an error when knitting "variable 4 appears to be constant within groups calls". I could not find a way to fix it. The text in the end is based on the results that I got (before knitting the document).
#Including all the variables in the Boston data as predictors in the LDA
#lda.fit2 <- lda(clusters ~ ., data = boston_scaled2)
#printing the LDA object
#lda.fit2
#Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution).
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "blue", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
#creating the plot
#plot(lda.fit2, dimen = 2, col=clusters, pch=clusters)
#lda.arrows(lda.fit2, myscale = 4.5)
Based on the LDA object and the plot, the most influencial linear separators for the clusters are nox, tax, zn, and medv. Some other variables, such as age, also help in separating the clusters. Still, because the first discriminant function is the most important one (explaining 85% of the explained variance between the clusters), the variables that load to it the strongest separate the clusters the best, i.e. nox (nitrogen oxides concentration) and tax (property-tax). Looking at the group means, group1 seems to be the “worst” neighborhoods, group2 the “best”, and group3 something in between.
#Loading in the data from my local data folder, row.names=1 keeps the country names as row names.
setwd("./data")
human_ <- read.csv("humanf.csv", row.names=1)
head(human_,4)
## edu2_ftom labour_ftom Edu_exp life_exp GNI mat_mor Adol_birth
## Norway 1.0072389 0.8908297 17.5 81.6 64992 4 7.8
## Australia 0.9968288 0.8189415 20.2 82.4 42261 6 12.1
## Switzerland 0.9834369 0.8251001 15.8 83.0 56431 6 1.9
## Denmark 0.9886128 0.8840361 18.7 80.2 44025 5 5.1
## parl_f
## Norway 39.6
## Australia 30.5
## Switzerland 28.5
## Denmark 38.0
dim(human_) # seems good
## [1] 155 8
#loading libraries
library(GGally)
library(ggplot2)
library(corrplot)
library(tidyverse)
#variable summaries
summary(human_)
## edu2_ftom labour_ftom Edu_exp life_exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI mat_mor Adol_birth parl_f
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
#a graphical overview of the data
ggpairs(human_,lower = list(combo = wrap("facethist", bins = 20)), upper = list(continuous = wrap("cor", size = 3)))
#looking more into the correlations
# calculating the correlation matrix and rounding it
cor_matrix<-cor(human_) %>% round(2)
#looking into the correlations
cor_matrix
## edu2_ftom labour_ftom Edu_exp life_exp GNI mat_mor Adol_birth
## edu2_ftom 1.00 0.01 0.59 0.58 0.43 -0.66 -0.53
## labour_ftom 0.01 1.00 0.05 -0.14 -0.02 0.24 0.12
## Edu_exp 0.59 0.05 1.00 0.79 0.62 -0.74 -0.70
## life_exp 0.58 -0.14 0.79 1.00 0.63 -0.86 -0.73
## GNI 0.43 -0.02 0.62 0.63 1.00 -0.50 -0.56
## mat_mor -0.66 0.24 -0.74 -0.86 -0.50 1.00 0.76
## Adol_birth -0.53 0.12 -0.70 -0.73 -0.56 0.76 1.00
## parl_f 0.08 0.25 0.21 0.17 0.09 -0.09 -0.07
## parl_f
## edu2_ftom 0.08
## labour_ftom 0.25
## Edu_exp 0.21
## life_exp 0.17
## GNI 0.09
## mat_mor -0.09
## Adol_birth -0.07
## parl_f 1.00
# visualizing the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d",tl.cex=0.6)
Descriptions of the variables:
Some observations about the data:
#PCA (with the SVD method)
pca_human <- prcomp(human_)
#printing the summary object and a summary of it
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4
## edu2_ftom -5.607472e-06 0.0006713951 -3.412027e-05 -2.736326e-04
## labour_ftom 2.331945e-07 -0.0002819357 5.302884e-04 -4.692578e-03
## Edu_exp -9.562910e-05 0.0075529759 1.427664e-02 -3.313505e-02
## life_exp -2.815823e-04 0.0283150248 1.294971e-02 -6.752684e-02
## GNI -9.999832e-01 -0.0057723054 -5.156742e-04 4.932889e-05
## mat_mor 5.655734e-03 -0.9916320120 1.260302e-01 -6.100534e-03
## Adol_birth 1.233961e-03 -0.1255502723 -9.918113e-01 5.301595e-03
## parl_f -5.526460e-05 0.0032317269 -7.398331e-03 -9.971232e-01
## PC5 PC6 PC7 PC8
## edu2_ftom -0.0022935252 2.180183e-02 6.998623e-01 7.139410e-01
## labour_ftom 0.0022190154 3.264423e-02 7.132267e-01 -7.001533e-01
## Edu_exp 0.1431180282 9.882477e-01 -3.826887e-02 7.776451e-03
## life_exp 0.9865644425 -1.453515e-01 5.380452e-03 2.281723e-03
## GNI -0.0001135863 -2.711698e-05 -8.075191e-07 -1.176762e-06
## mat_mor 0.0266373214 1.695203e-03 1.355518e-04 8.371934e-04
## Adol_birth 0.0188618600 1.273198e-02 -8.641234e-05 -1.707885e-04
## parl_f -0.0716401914 -2.309896e-02 -2.642548e-03 2.680113e-03
summary(pca_human) #this shows clearly the proportion of variance captured by each component. The first PC explains 99.9 % of the variance that is captured. It also shows the standard deviations.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
#Drawing a biplot
biplot(pca_human, choices = 1:2, cex= c(1,1), col=c("orange", "navy"))
Interpreting and commenting on the results:
#Let's draw the plot again but name the PC1 as "Low standard of living", see: http://hdr.undp.org/sites/default/files/hdr2015_technical_notes.pdf
biplot(pca_human, choices = 1:2, cex= c(1,1), col=c("orange", "navy"), xlab = "Low standard of living")
#standardizing the data
human_std <- scale(human_)
summary(human_std) #seems good.
## edu2_ftom labour_ftom Edu_exp life_exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI mat_mor Adol_birth parl_f
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
#PCA (with the SVD method)
pca_humanstd <- prcomp(human_std)
pca_humanstd
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## edu2_ftom -0.35664370 0.03796058 -0.24223089 0.62678110 -0.5983585
## labour_ftom 0.05457785 0.72432726 -0.58428770 0.06199424 0.2625067
## Edu_exp -0.42766720 0.13940571 -0.07340270 -0.07020294 0.1659678
## life_exp -0.44372240 -0.02530473 0.10991305 -0.05834819 0.1628935
## GNI -0.35048295 0.05060876 -0.20168779 -0.72727675 -0.4950306
## mat_mor 0.43697098 0.14508727 -0.12522539 -0.25170614 -0.1800657
## Adol_birth 0.41126010 0.07708468 0.01968243 0.04986763 -0.4672068
## parl_f -0.08438558 0.65136866 0.72506309 0.01396293 -0.1523699
## PC6 PC7 PC8
## edu2_ftom 0.17713316 0.05773644 0.16459453
## labour_ftom -0.03500707 -0.22729927 -0.07304568
## Edu_exp -0.38606919 0.77962966 -0.05415984
## life_exp -0.42242796 -0.43406432 0.62737008
## GNI 0.11120305 -0.13711838 -0.16961173
## mat_mor 0.17370039 0.35380306 0.72193946
## Adol_birth -0.76056557 -0.06897064 -0.14335186
## parl_f 0.13749772 0.00568387 -0.02306476
summary(pca_humanstd) # a more reasonable/useful finding with the first PC explaining 54% of the variance, 2nd PC explaining 16% and third explaining 10% etc.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
#Drawing the biplot
biplot(pca_humanstd, choices = 1:2, cex= c(0.8,0.9), col=c("lightpink", "darkblue"))
Interpreting and commenting on the results:
#Let's draw the plot again so that PC1 is named "Standard of living and education" and PC2 is named "Equality".
biplot(pca_humanstd, choices = 1:2, cex= c(1,1), col=c("orange", "navy"), xlab = "Low standard of living and education", ylab= "Equality")
#Load the tea dataset from the package Factominer.
library(FactoMineR)
data("tea")
dim(tea) #300 obs, 36 var.
## [1] 300 36
head(tea,6) #data largely strings, text format variables
## breakfast tea.time evening lunch dinner always home
## 1 breakfast Not.tea time Not.evening Not.lunch Not.dinner Not.always home
## 2 breakfast Not.tea time Not.evening Not.lunch Not.dinner Not.always home
## 3 Not.breakfast tea time evening Not.lunch dinner Not.always home
## 4 Not.breakfast Not.tea time Not.evening Not.lunch dinner Not.always home
## 5 breakfast Not.tea time evening Not.lunch Not.dinner always home
## 6 Not.breakfast Not.tea time Not.evening Not.lunch dinner Not.always home
## work tearoom friends resto pub Tea How sugar
## 1 Not.work Not.tearoom Not.friends Not.resto Not.pub black alone sugar
## 2 Not.work Not.tearoom Not.friends Not.resto Not.pub black milk No.sugar
## 3 work Not.tearoom friends resto Not.pub Earl Grey alone No.sugar
## 4 Not.work Not.tearoom Not.friends Not.resto Not.pub Earl Grey alone sugar
## 5 Not.work Not.tearoom Not.friends Not.resto Not.pub Earl Grey alone No.sugar
## 6 Not.work Not.tearoom Not.friends Not.resto Not.pub Earl Grey alone No.sugar
## how where price age sex SPC Sport age_Q
## 1 tea bag chain store p_unknown 39 M middle sportsman 35-44
## 2 tea bag chain store p_variable 45 F middle sportsman 45-59
## 3 tea bag chain store p_variable 47 F other worker sportsman 45-59
## 4 tea bag chain store p_variable 23 M student Not.sportsman 15-24
## 5 tea bag chain store p_variable 48 M employee sportsman 45-59
## 6 tea bag chain store p_private label 21 M student sportsman 15-24
## frequency escape.exoticism spirituality healthy diuretic
## 1 1/day Not.escape-exoticism Not.spirituality healthy Not.diuretic
## 2 1/day escape-exoticism Not.spirituality healthy diuretic
## 3 +2/day Not.escape-exoticism Not.spirituality healthy diuretic
## 4 1/day escape-exoticism spirituality healthy Not.diuretic
## 5 +2/day escape-exoticism spirituality Not.healthy diuretic
## 6 1/day Not.escape-exoticism Not.spirituality healthy Not.diuretic
## friendliness iron.absorption feminine sophisticated
## 1 Not.friendliness Not.iron absorption Not.feminine Not.sophisticated
## 2 Not.friendliness Not.iron absorption Not.feminine Not.sophisticated
## 3 friendliness Not.iron absorption Not.feminine Not.sophisticated
## 4 Not.friendliness Not.iron absorption Not.feminine sophisticated
## 5 friendliness Not.iron absorption Not.feminine Not.sophisticated
## 6 Not.friendliness Not.iron absorption Not.feminine Not.sophisticated
## slimming exciting relaxing effect.on.health
## 1 No.slimming No.exciting No.relaxing No.effect on health
## 2 No.slimming exciting No.relaxing No.effect on health
## 3 No.slimming No.exciting relaxing No.effect on health
## 4 No.slimming No.exciting relaxing No.effect on health
## 5 No.slimming No.exciting relaxing No.effect on health
## 6 No.slimming No.exciting relaxing No.effect on health
str(tea) # variables are of type factor and most of them have two levels. The data measures matters related to tea consumption.
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
#visualizing the data with bar plots
#first, selecting parts of the data
tea1 <- dplyr::select(tea, 1:13)
tea2 <- dplyr::select(tea, 14:25)
tea3 <- dplyr::select(tea, 26:36)
#drawing the plots
gather(tea1) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar () + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
gather(tea2) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar () + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
gather(tea3) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar () + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
head(tea3,4)
## spirituality healthy diuretic friendliness iron.absorption
## 1 Not.spirituality healthy Not.diuretic Not.friendliness Not.iron absorption
## 2 Not.spirituality healthy diuretic Not.friendliness Not.iron absorption
## 3 Not.spirituality healthy diuretic friendliness Not.iron absorption
## 4 spirituality healthy Not.diuretic Not.friendliness Not.iron absorption
## feminine sophisticated slimming exciting relaxing
## 1 Not.feminine Not.sophisticated No.slimming No.exciting No.relaxing
## 2 Not.feminine Not.sophisticated No.slimming exciting No.relaxing
## 3 Not.feminine Not.sophisticated No.slimming No.exciting relaxing
## 4 Not.feminine sophisticated No.slimming No.exciting relaxing
## effect.on.health
## 1 No.effect on health
## 2 No.effect on health
## 3 No.effect on health
## 4 No.effect on health
#MCA
mca <- MCA(tea3, graph = TRUE)
summary(mca) #the dimensions explain variance so that the 1st dimension explains 15%, the 2nd 13%, the third and fourth both 11% of variance. Altogether, the MCA produces 11 dimensions.
##
## Call:
## MCA(X = tea3, graph = TRUE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.153 0.128 0.111 0.105 0.095 0.085 0.077
## % of var. 15.296 12.806 11.108 10.516 9.499 8.511 7.694
## Cumulative % of var. 15.296 28.102 39.210 49.726 59.225 67.736 75.430
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.070 0.069 0.057 0.050
## % of var. 7.049 6.851 5.714 4.955
## Cumulative % of var. 82.480 89.331 95.045 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 0.659 0.947 0.380 | -0.343 0.307 0.103 | 0.573
## 2 | 0.368 0.294 0.115 | 0.088 0.020 0.007 | 0.903
## 3 | 0.137 0.041 0.030 | -0.446 0.518 0.315 | 0.297
## 4 | 0.173 0.066 0.030 | -0.545 0.773 0.293 | -0.161
## 5 | 0.244 0.130 0.062 | -0.207 0.112 0.045 | -0.119
## 6 | 0.568 0.702 0.307 | -0.652 1.108 0.406 | 0.312
## 7 | 0.568 0.702 0.307 | -0.652 1.108 0.406 | 0.312
## 8 | -0.127 0.035 0.029 | -0.378 0.372 0.261 | -0.338
## 9 | 0.610 0.810 0.362 | -0.299 0.233 0.087 | -0.215
## 10 | 0.137 0.041 0.030 | -0.446 0.518 0.315 | 0.297
## ctr cos2
## 1 0.987 0.288 |
## 2 2.449 0.697 |
## 3 0.264 0.139 |
## 4 0.078 0.026 |
## 5 0.043 0.015 |
## 6 0.292 0.093 |
## 7 0.292 0.093 |
## 8 0.343 0.208 |
## 9 0.139 0.045 |
## 10 0.264 0.139 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## Not.spirituality | 0.222 2.004 0.108 5.673 | 0.004 0.001 0.000 0.109
## spirituality | -0.486 4.393 0.108 -5.673 | -0.009 0.002 0.000 -0.109
## healthy | -0.351 5.115 0.287 -9.262 | -0.286 4.074 0.191 -7.563
## Not.healthy | 0.818 11.936 0.287 9.262 | 0.668 9.507 0.191 7.563
## diuretic | -0.406 5.687 0.228 -8.253 | 0.210 1.815 0.061 4.266
## Not.diuretic | 0.561 7.853 0.228 8.253 | -0.290 2.506 0.061 -4.266
## friendliness | -0.171 1.407 0.122 -6.051 | 0.060 0.208 0.015 2.129
## Not.friendliness | 0.715 5.870 0.122 6.051 | -0.252 0.868 0.015 -2.129
## iron absorption | -0.759 3.538 0.066 -4.455 | -0.275 0.554 0.009 -1.614
## Not.iron absorption | 0.087 0.408 0.066 4.455 | 0.032 0.064 0.009 1.614
## Dim.3 ctr cos2 v.test
## Not.spirituality | 0.208 2.434 0.095 5.328 |
## spirituality | -0.456 5.335 0.095 -5.328 |
## healthy | 0.258 3.812 0.155 6.813 |
## Not.healthy | -0.602 8.894 0.155 -6.813 |
## diuretic | 0.324 4.977 0.145 6.580 |
## Not.diuretic | -0.447 6.873 0.145 -6.580 |
## friendliness | -0.160 1.687 0.107 -5.647 |
## Not.friendliness | 0.667 7.040 0.107 5.647 |
## iron absorption | 0.580 2.841 0.039 3.402 |
## Not.iron absorption | -0.067 0.327 0.039 -3.402 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## spirituality | 0.108 0.000 0.095 |
## healthy | 0.287 0.191 0.155 |
## diuretic | 0.228 0.061 0.145 |
## friendliness | 0.122 0.015 0.107 |
## iron.absorption | 0.066 0.009 0.039 |
## feminine | 0.304 0.027 0.058 |
## sophisticated | 0.198 0.039 0.233 |
## slimming | 0.236 0.052 0.038 |
## exciting | 0.020 0.340 0.046 |
## relaxing | 0.037 0.347 0.216 |
#Based on the eta squared values, the most important variables that capture the underlying dimensions seem to be: feminine and healthy on the first dimension, exciting and relaxing on the second dimension, and relaxing and sophisticated on the third dimension. Thus people view drinking tea differently and certain views tend to occur together (for instance viewing tea drinking as feminine and healthy). These dimensions also set people apart, characterizing people.
#drawing a plot
plot(mca, invisible=c("ind"), habillage = "quali")
#The plot further shows the relations between the different variables (closer together = more related) as well as between the variables and the dimensions.
This chapter follows MABS, which is a book by Vehkalahti & Everitt from 2019.
#libraries
library(dplyr)
library(tidyr)
library(gtsummary)
library(ggplot2)
#loading the BPRS_l into R
BPRS_l <- read.csv("BPRS_l.csv", row.names=1)
dim(BPRS_l)
## [1] 360 5
str(BPRS_l)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
#transforming treatment and subject to factors (again)
BPRS_l$treatment <- factor(BPRS_l$treatment)
BPRS_l$subject <- factor(BPRS_l$subject)
head(BPRS_l,30)
## treatment subject weeks bprs week
## 1 1 1 week0 42 0
## 2 1 2 week0 58 0
## 3 1 3 week0 54 0
## 4 1 4 week0 55 0
## 5 1 5 week0 72 0
## 6 1 6 week0 48 0
## 7 1 7 week0 71 0
## 8 1 8 week0 30 0
## 9 1 9 week0 41 0
## 10 1 10 week0 57 0
## 11 1 11 week0 30 0
## 12 1 12 week0 55 0
## 13 1 13 week0 36 0
## 14 1 14 week0 38 0
## 15 1 15 week0 66 0
## 16 1 16 week0 41 0
## 17 1 17 week0 45 0
## 18 1 18 week0 39 0
## 19 1 19 week0 24 0
## 20 1 20 week0 38 0
## 21 2 1 week0 52 0
## 22 2 2 week0 30 0
## 23 2 3 week0 65 0
## 24 2 4 week0 37 0
## 25 2 5 week0 59 0
## 26 2 6 week0 30 0
## 27 2 7 week0 69 0
## 28 2 8 week0 62 0
## 29 2 9 week0 38 0
## 30 2 10 week0 65 0
tail(BPRS_l,20) #looks good
## treatment subject weeks bprs week
## 341 2 1 week8 50 8
## 342 2 2 week8 20 8
## 343 2 3 week8 32 8
## 344 2 4 week8 23 8
## 345 2 5 week8 35 8
## 346 2 6 week8 21 8
## 347 2 7 week8 35 8
## 348 2 8 week8 66 8
## 349 2 9 week8 21 8
## 350 2 10 week8 39 8
## 351 2 11 week8 75 8
## 352 2 12 week8 23 8
## 353 2 13 week8 28 8
## 354 2 14 week8 27 8
## 355 2 15 week8 37 8
## 356 2 16 week8 22 8
## 357 2 17 week8 43 8
## 358 2 18 week8 30 8
## 359 2 19 week8 21 8
## 360 2 20 week8 23 8
#loading the RATS_l into R
RATS_l <- read.csv("RATS_l.csv", row.names=1)
dim(RATS_l)
## [1] 176 5
glimpse(RATS_l)
## Rows: 176
## Columns: 5
## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,...
## $ Group <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, ...
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, ...
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, ...
#transforming the ID and Group variables to factors:
RATS_l$ID <- factor(RATS_l$ID)
RATS_l$Group <- factor(RATS_l$Group)
head(RATS_l,30)
## ID Group WD Weight Time
## 1 1 1 WD1 240 1
## 2 2 1 WD1 225 1
## 3 3 1 WD1 245 1
## 4 4 1 WD1 260 1
## 5 5 1 WD1 255 1
## 6 6 1 WD1 260 1
## 7 7 1 WD1 275 1
## 8 8 1 WD1 245 1
## 9 9 2 WD1 410 1
## 10 10 2 WD1 405 1
## 11 11 2 WD1 445 1
## 12 12 2 WD1 555 1
## 13 13 3 WD1 470 1
## 14 14 3 WD1 535 1
## 15 15 3 WD1 520 1
## 16 16 3 WD1 510 1
## 17 1 1 WD8 250 8
## 18 2 1 WD8 230 8
## 19 3 1 WD8 250 8
## 20 4 1 WD8 255 8
## 21 5 1 WD8 260 8
## 22 6 1 WD8 265 8
## 23 7 1 WD8 275 8
## 24 8 1 WD8 255 8
## 25 9 2 WD8 415 8
## 26 10 2 WD8 420 8
## 27 11 2 WD8 445 8
## 28 12 2 WD8 560 8
## 29 13 3 WD8 465 8
## 30 14 3 WD8 525 8
tail(RATS_l,20) #looks good
## ID Group WD Weight Time
## 157 13 3 WD57 518 57
## 158 14 3 WD57 544 57
## 159 15 3 WD57 555 57
## 160 16 3 WD57 553 57
## 161 1 1 WD64 278 64
## 162 2 1 WD64 245 64
## 163 3 1 WD64 269 64
## 164 4 1 WD64 275 64
## 165 5 1 WD64 280 64
## 166 6 1 WD64 281 64
## 167 7 1 WD64 284 64
## 168 8 1 WD64 278 64
## 169 9 2 WD64 478 64
## 170 10 2 WD64 496 64
## 171 11 2 WD64 472 64
## 172 12 2 WD64 628 64
## 173 13 3 WD64 525 64
## 174 14 3 WD64 559 64
## 175 15 3 WD64 548 64
## 176 16 3 WD64 569 64
The data is described as follows in the book: The RATS data is part of a larger "data from a nutrition study conducted in three groups of rats (Crowder and Hand, 1990). The three groups were put on different diets, and each animal’s body weight (grams) was recorded repeatedly (approximately weekly, except in week seven when two recordings were taken) over a 9-week period. The question of most interest is whether the growth profiles of the three groups differ.
# A summary table of the RATS data:
#Print out a summary table of all variables
my_table <- tbl_summary(RATS_l,
digits = all_continuous() ~ 2,
type = all_continuous() ~ "continuous2",
statistic = list(all_continuous() ~ c("{mean} ({sd})",
"{median} ({p25}, {p75})",
"{min}, {max}")))
#Add some missing elements to the table and print it out
my_table %>% bold_labels()
| Characteristic | N = 176 |
|---|---|
| ID | |
| 1 | 11 (6.2%) |
| 2 | 11 (6.2%) |
| 3 | 11 (6.2%) |
| 4 | 11 (6.2%) |
| 5 | 11 (6.2%) |
| 6 | 11 (6.2%) |
| 7 | 11 (6.2%) |
| 8 | 11 (6.2%) |
| 9 | 11 (6.2%) |
| 10 | 11 (6.2%) |
| 11 | 11 (6.2%) |
| 12 | 11 (6.2%) |
| 13 | 11 (6.2%) |
| 14 | 11 (6.2%) |
| 15 | 11 (6.2%) |
| 16 | 11 (6.2%) |
| Group | |
| 1 | 88 (50%) |
| 2 | 44 (25%) |
| 3 | 44 (25%) |
| WD | |
| WD1 | 16 (9.1%) |
| WD15 | 16 (9.1%) |
| WD22 | 16 (9.1%) |
| WD29 | 16 (9.1%) |
| WD36 | 16 (9.1%) |
| WD43 | 16 (9.1%) |
| WD44 | 16 (9.1%) |
| WD50 | 16 (9.1%) |
| WD57 | 16 (9.1%) |
| WD64 | 16 (9.1%) |
| WD8 | 16 (9.1%) |
| Weight | |
| Mean (SD) | 384.48 (127.16) |
| Median (IQR) | 344.50 (267.00, 511.25) |
| Range | 225.00, 628.00 |
| Time | |
| Mean (SD) | 33.55 (19.51) |
| Median (IQR) | 36.00 (15.00, 50.00) |
| Range | 1.00, 64.00 |
#other data examinations are done above (and in the separate R script).
#First, let's plot the weight values of all 16 rats, differentiating between the three groups.
ggplot(RATS_l, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(name = "Weights in grams", limits = c(min(RATS_l$Weight), max(RATS_l$Weight))) + theme_bw()
# Next, let's make the same plot but with standardized weight values.
#standardizing weight measures, i.e., subtracting the relevant occasion mean from the original observation and divide it with the corresponding standard deviation.
RATS_ls <- RATS_l %>%
group_by(Time) %>%
mutate(std_Weight = (Weight - mean(Weight))/sd(Weight)) %>%
ungroup()
glimpse(RATS_ls) #looks ok.
## Rows: 176
## Columns: 6
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1,...
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", ...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 5...
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8,...
## $ std_Weight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.88190...
# Plotting the standardized weights per group
ggplot(RATS_ls, aes(x = Time, y = std_Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized Weights") + theme_bw()
#
#First, let's create a graph with group means of the rats' weights with standard error of the mean (se) indicating the variation at each measurement point.
#
#to calculate se we need to know n, which is now measurement times:
n <- RATS_l$Time %>% unique() %>% length()
n #11, correct.
## [1] 11
# Creating a summary data with mean and se of weight by group and measurement time
RATS_s <- RATS_l %>%
group_by(Group, Time) %>%
summarise(mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
ungroup()
glimpse(RATS_s) #mean and se of weight values in each group for each measurement time.
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 3...
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375...
## $ se <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3...
#Let's plot these mean weight profiles
ggplot(RATS_s, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.9,0.5)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)") + scale_x_continuous(name="time(days)") + theme_classic()
# Let's visualize the same group means also with a box plot.
ggplot(RATS_l, aes(x=Time, y=Weight,colour=Group, fill=Group)) +
stat_summary(fun = mean, geom = "bar", position="dodge", width=3) +
stat_summary(fun.data = mean_se, geom = "errorbar", position="dodge", width=3) + theme_bw()
The plots with error bars (standard error of the means) confirms that group 1 differs from the other groups and shows that groups 2 and 3 differ from each other first but then get closer to each other, with the confidence intervals nearly touching (or actually touching/overlapping a bit!) at the end of the measurement period. Thus, based on the plots, these two groups may differ from each other, at least in overall means. However, the difference is not big and seems to mainly disapper in the end.
Since the question of most interest is “whether the growth profiles of the three groups differ” it would perhaps be most informative to examine summary statistics where the first group mean of rat weight is reduced from the final group mean of rat weight, in each group. Or examine with regression whether the rate of change in body weight differs between the groups. However, let’s just take the easiest route for now and do one summary graph with overall group means. We can account for the beginning weight in, for instance, Ancova (Analysis of covariance) later.
# First, creating summaries of data by Group and ID with mean as the summary variable.
#
RATS_s2 <- RATS_l %>%
group_by(Group, ID) %>%
summarise(mean=mean(Weight)) %>%
ungroup()
glimpse(RATS_s2)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 2...
#Drawing a boxplot of the mean versus treatment
ggplot(RATS_s2, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(weight), weeks 1-9")
# Let's remove the most notable outlier, namely the one in Group 2.
# Creating a new data by filtering this one outlier
RATS_s2_no <- RATS_s2 %>% filter(mean<570)
dim(RATS_s2_no)
## [1] 15 3
#Let's draw the above plot again without the outlier
ggplot(RATS_s2_no, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(weight), weeks 1-9")
The first boxplots show three outliers, with the one in group 2 being the most notable. Indeed, the weight values are not normally distributed in group 2 in the first plot. The weight measures also appear, at least somewhat skewed in group 3.
Removing the outlier from group 2 makes a big difference as the variance in that group is now a lot smaller and the weights seem to be normally distributed. It would have likely been best to also delete the other outliers, particularly from group 3. I realized this a bit too late…
The boxplots also show clearly group differences (but they are also calculated with overall means)
# First, let's apply an analysis of variance (Anova) to examine the groups' differences. Let's use the data without the one outlier that was removed.
res.aov <- aov(mean ~ Group, data = RATS_s2_no)
# Summary of the analysis
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 206390 103195 483.6 3.39e-12 ***
## Residuals 12 2561 213
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Some of the groups differ (significant F-test value at p <.001)
#multiple comparisons, Bonferroni corrected p-value
pairwise.t.test(RATS_s2_no$mean, RATS_s2_no$Group, p.adj = "bonf")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: RATS_s2_no$mean and RATS_s2_no$Group
##
## 1 2
## 2 8.7e-10 -
## 3 4.7e-12 5.4e-05
##
## P value adjustment method: bonferroni
#The p-values show that all groups differ from each others when comparing group means in weight (bonferroni corrected p-value <.001)
# Let's take into account the first mean weight values in each group and run an Ancova.
#First counting the summary data without the first measurement value.
RATS_s3 <- RATS_l %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight)) %>%
ungroup()
#removing the one outlier
RATS_s3_no <- RATS_s3 %>% filter(mean<570)
#loading in the original data in a wide format
RATS_wide <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt",sep ="\t",header=TRUE)
# Adding the baseline weights from the original data as a new variable to the summary data
#baseline <- RATS_wide$WD1
#RATS_s3_no <- RATS_s3_no %>%
#mutate(baseline) # does not work since row numbers differ, the code in datacamp did't work for this either
# Fitting a linear model with the mean as the response
fit <- lm(mean ~ Group, data = RATS_s2_no) #should add baseline but can't now as it's not in the data.
# Computing the analysis of variance table for the fitted model with anova()
summary(fit)
##
## Call:
## lm(formula = mean ~ Group, data = RATS_s2_no)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.886 -3.080 3.273 9.250 14.386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 263.716 5.165 51.06 2.09e-15 ***
## Group2 185.739 9.890 18.78 2.90e-10 ***
## Group3 262.080 8.945 29.30 1.56e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.61 on 12 degrees of freedom
## Multiple R-squared: 0.9877, Adjusted R-squared: 0.9857
## F-statistic: 483.6 on 2 and 12 DF, p-value: 3.387e-12
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 206390 103195 483.6 3.387e-12 ***
## Residuals 12 2561 213
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#The anova (without baseline) and regression shows similar results as above. I'm interested to see how this baseline should have been added!
A note from the book: “Baseline measurements of the outcome variable in a longitudinal study are often correlated with the chosen summary measure and using such measures in the analysis can often lead to substantial gains in precision when used appropriately as a covariate in an analysis of covariance.”
A short description of the data, taken from the book: The data is about “40 male subjects who were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.” I am assuming that higher score = higher chance of schizophrenia, or at least more schizophrenic symptoms.
#First, let's ignore the repeated-measures in the data and treat each observation as independent.
#Print out a summary table of all variables
my_table2 <- tbl_summary(BPRS_l,
digits = all_continuous() ~ 2,
type = all_continuous() ~ "continuous2",
statistic = list(all_continuous() ~ c("{mean} ({sd})",
"{median} ({p25}, {p75})",
"{min}, {max}")))
#Add some missing elements to the table and print it out
my_table2 %>% bold_labels()
| Characteristic | N = 360 |
|---|---|
| treatment | |
| 1 | 180 (50%) |
| 2 | 180 (50%) |
| subject | |
| 1 | 18 (5.0%) |
| 2 | 18 (5.0%) |
| 3 | 18 (5.0%) |
| 4 | 18 (5.0%) |
| 5 | 18 (5.0%) |
| 6 | 18 (5.0%) |
| 7 | 18 (5.0%) |
| 8 | 18 (5.0%) |
| 9 | 18 (5.0%) |
| 10 | 18 (5.0%) |
| 11 | 18 (5.0%) |
| 12 | 18 (5.0%) |
| 13 | 18 (5.0%) |
| 14 | 18 (5.0%) |
| 15 | 18 (5.0%) |
| 16 | 18 (5.0%) |
| 17 | 18 (5.0%) |
| 18 | 18 (5.0%) |
| 19 | 18 (5.0%) |
| 20 | 18 (5.0%) |
| weeks | |
| week0 | 40 (11%) |
| week1 | 40 (11%) |
| week2 | 40 (11%) |
| week3 | 40 (11%) |
| week4 | 40 (11%) |
| week5 | 40 (11%) |
| week6 | 40 (11%) |
| week7 | 40 (11%) |
| week8 | 40 (11%) |
| bprs | |
| Mean (SD) | 37.66 (13.66) |
| Median (IQR) | 35.00 (27.00, 43.00) |
| Range | 18.00, 95.00 |
| week | |
| 0 | 40 (11%) |
| 1 | 40 (11%) |
| 2 | 40 (11%) |
| 3 | 40 (11%) |
| 4 | 40 (11%) |
| 5 | 40 (11%) |
| 6 | 40 (11%) |
| 7 | 40 (11%) |
| 8 | 40 (11%) |
head(BPRS_l,41) #The summary table shows that there are 20 subject but there are actually 40, the ID numbers are from 1-20 from both treatment groups and the groups have different people!
## treatment subject weeks bprs week
## 1 1 1 week0 42 0
## 2 1 2 week0 58 0
## 3 1 3 week0 54 0
## 4 1 4 week0 55 0
## 5 1 5 week0 72 0
## 6 1 6 week0 48 0
## 7 1 7 week0 71 0
## 8 1 8 week0 30 0
## 9 1 9 week0 41 0
## 10 1 10 week0 57 0
## 11 1 11 week0 30 0
## 12 1 12 week0 55 0
## 13 1 13 week0 36 0
## 14 1 14 week0 38 0
## 15 1 15 week0 66 0
## 16 1 16 week0 41 0
## 17 1 17 week0 45 0
## 18 1 18 week0 39 0
## 19 1 19 week0 24 0
## 20 1 20 week0 38 0
## 21 2 1 week0 52 0
## 22 2 2 week0 30 0
## 23 2 3 week0 65 0
## 24 2 4 week0 37 0
## 25 2 5 week0 59 0
## 26 2 6 week0 30 0
## 27 2 7 week0 69 0
## 28 2 8 week0 62 0
## 29 2 9 week0 38 0
## 30 2 10 week0 65 0
## 31 2 11 week0 78 0
## 32 2 12 week0 38 0
## 33 2 13 week0 63 0
## 34 2 14 week0 40 0
## 35 2 15 week0 40 0
## 36 2 16 week0 54 0
## 37 2 17 week0 33 0
## 38 2 18 week0 28 0
## 39 2 19 week0 52 0
## 40 2 20 week0 47 0
## 41 1 1 week1 36 1
str(BPRS_l)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
#Let's recode the subject values so that everyone has individual "id". Maybe this helps later on with some errors I got in the plots (it did, yay!)
BPRS_l_test <- BPRS_l %>%
mutate(subject = ifelse(as.character(treatment) == "2", recode(as.character(subject), "1" = 21, "2" = 22, "3" = 23, "4" = 24, "5" = 25, "6" = 26, "7" = 27, "8" =28, "9"=29, "10"=30, "11"=31, "12"=32, "13"=33, "14"=34, "15"=35, "16"=36, "17"=37, "18"=38, "19"=39, "20"=40), subject))
#Let's start with plotting the data. Let's note group partnerships but ignore the longitudinal nature of the data for now.
# Drawing the measures against week plots
#
#First, let's make a scatterplot. And add some jitter to maybe see the overlapping observations a bit better.
p1 <- ggplot(BPRS_l, aes(x = week, y = bprs, group = subject))
p2 <- p1 + geom_text(aes(label = treatment), position=position_jitter(width=0.15,height=0.1))
p3 <- p2 + scale_x_continuous(name = "Week", breaks = seq(0, 8, 1))
p4 <- p3 + scale_y_continuous(name = "Measure")
p5 <- p4 + theme_bw()
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p6
# Still ignoring the repeated-measures structure of the data, let's fit a multiple linear regression model where bprs is response and week and treatment are explanatory variables.
# create a regression model BPRS_reg
BPRS_reg <- lm(bprs ~ week + treatment, data=BPRS_l)
# print out a summary of the model
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRS_l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
#Let's NOT include a week*treatment interaction term into the model,
#not smart since the model structure has repeated measures, i.e. data points are not independent!
#BPRS_reg2 <- lm(bprs ~ week * treatment, data=BPRS_l)
Both groups seem to have very overlapping values in the scatter plot, i.e. there does not seem to be, at least no clear, differences between the treatment groups.
The linear model (regression model) supports this view since treatment2 does not differ significantly from treatment1 (p > 0.1). Treatment week seems to have a negative effect on symptoms (p <.001) but we need to remember that the datapoints are not actually uncorrelated with each other (measurements from same individuals)
#making a line plot which accounts for the longitudinal structure of the data by joining together the measurement values from same individual.
p7 <- ggplot(BPRS_l_test, aes(x = week, y = bprs, group = subject))
p8 <- p7 + geom_line(aes(colour = treatment))
p9 <- p8 + scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1))
p10 <- p9 + scale_y_continuous(name = "Measure(bprs)")
p11 <- p10 + theme_bw() + theme(legend.position = "top")
p12 <- p11 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p12
#same plot but with linetype, not colour indicating treatment groups.
ggplot(BPRS_l_test, aes(x = week, y = bprs, group = subject)) +
geom_line(aes(linetype=treatment)) + scale_x_continuous(name = "Time (days)", breaks = seq(0, 8, 1)) +
scale_y_continuous(name = "Weight (grams)") + theme(legend.position = "top")
#Let's make an additional plot.
ggplot(BPRS_l, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRS_l$bprs), max(BPRS_l$bprs)))
#Well, this is not any easier to interpret due to the large amount of subjects. Differences are easier to see in the above graphs, in the same plots.
The plots show that, indeed, the individuals in different groups do not form clear groups in regards to symptom change. There seems to also be at least one (blue, treatment group2) outlier in the data. The plots do show the overall downward trend of the symptoms, though. Also the variability in symptoms seems to be narrower in the end with a smaller range of symptoms at 8 weeks. The faceted plot also indicates that there may be a bit more symptom variability in group 2, even when ignoring the one outlier in that group.
#loading in the data in a wide format:
BPRS_wide <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt",sep =" ",header=TRUE)
#Let's plot the bprs-values per week
pairs(BPRS_wide[,3:11], cex = 0.7)
#The plot shows that the measurements are related, there is an observable trend in many of the pictures.
#Let's plot the bprs-values per week with treatment group coloured
pairs(BPRS_wide[,2:11], col=BPRS_wide$treatment, cex = 0.7)
library(lme4)
#First, let's fit a random intercept model and include the two explanatory variables: week and treatment.
#Fitting the random intercept model with the subject as the random effect
BPRS_rim <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRS_l, REML = FALSE)
# Print the summary of the model
summary(BPRS_rim)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRS_l
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
#treatment group has no effect on symptoms.
#Next, let's fit a random intercept AND random slope model
#Let's set week and subject as the random effects
BPRS_riasm <- lmer(bprs ~ week + treatment + (week | subject), data = BPRS_l, REML = FALSE)
# print a summary of the model
summary(BPRS_riasm)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRS_l
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
# performinf an ANOVA test to compare the two models
anova(BPRS_riasm, BPRS_rim)
## Data: BPRS_l
## Models:
## BPRS_rim: bprs ~ week + treatment + (1 | subject)
## BPRS_riasm: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_rim 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_riasm 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# The model with also a random slope for each individual is better at predicting symptom change, fits the data better. This can be seen from the Chisq value and AIC, and BIS (smaller is better). However, the difference between the models is not big at all (p = .03).
# Fitting a random intercept and slope model with a treatment × week interaction. That is, making the same model as previously but with interaction.
BPRS_riasmi <- lmer(bprs ~ week * treatment + (week | subject), data = BPRS_l, REML = FALSE)
summary(BPRS_riasmi)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
## Data: BPRS_l
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
#comparing the last two models with Anova
# Computing the analysis of variance tables of the two models
anova(BPRS_riasmi, BPRS_riasm)
## Data: BPRS_l
## Models:
## BPRS_riasm: bprs ~ week + treatment + (week | subject)
## BPRS_riasmi: bprs ~ week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_riasm 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_riasmi 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Making a plot of BPRS_l with the observed bprs values
ggplot(BPRS_l_test, aes(x = week, y = bprs, group = subject)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +
scale_y_continuous(name = "Score(bprs") +
theme(legend.position = "top")
# Creating a vector of the fitted values
Fitted <- fitted(BPRS_riasmi)
# Creating a new column fitted to BPRS_l_test2
BPRS_l_test2 <- BPRS_l_test %>%
mutate(Fitted)
# draw the plot of BPRS with the Fitted values of bprs
# #instead of linetype let's do colour = treatment.
ggplot(BPRS_l_test2, aes(x = week, y = Fitted, group = subject)) +
geom_line(aes(colour = treatment)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +
scale_y_continuous(name = "Fitted bprs") +
theme(legend.position = "top")
#Let's plot the observed and plotted values next to each other:
px1 <- ggplot(BPRS_l_test, aes(x = week, y = bprs, group = subject))
px2 <- px1 + geom_line(aes(linetype = treatment))
px3 <- px2 + scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1))
px4 <- px3 + scale_y_continuous(name = "bprs")
px5 <- px4 + theme_bw() + theme(legend.position = "right")
px6 <- px5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
px7 <- px6 + ggtitle("Observed")
graph1 <- px7
py1 <- ggplot(BPRS_l_test2, aes(x = week, y = Fitted, group = subject))
py2 <- py1 + geom_line(aes(linetype = treatment))
py3 <- py2 + scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1))
py4 <- py3 + scale_y_continuous(name = "bprs")
py5 <- py4 + theme_bw() + theme(legend.position = "right")
py6 <- py5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
py7 <- py6 + ggtitle("Fitted")
graph2 <- py7
graph1; graph2
#Again, no differences between the groups.